Introduction

For this particular project, we are looking at the stock prices of three major tech companies: Apple, Facebook and Google, from the years of 2012 to 2018. We got access to the dates, opening S&P 500 indices and closing prices of these companies from Yahoo Finance.

Based on the nature of stock market, there can potentially be temporal structures when analyzing and predicting stock prices. We took the closing price as the response variable and tried to fit appropriate models with the S&P 500 indices as the mean structure to help predict the closing prices each day for each company, as well as understanding the temporal structures within.

Due to the fact that all of the companies share the same timeline and have data of their own at the same time points, we would have to fit three individual models for each of them. We will first look at some Explanatory Data Analyses for all three companies, then try to fit simpler models with no temporal structures, and finally fit and evaluate temporal models for each company. For the temporal model fitting part specifically, we will try two different methods: both auto-fitting ARIMA models, as well as models with Gaussian Process. We then compared the performances of all three types models fit by both methods.

Lastly, we wish to come to a conclusion for our questions of interest: can we predict the closing prices of stocks with the S&P 500 index and date solely in a simple linear model? Is there any temporal dependency in the closing prices? Are there differences among different companies or do they share similar trends and structures in their stocks? Can this trend or model, potentially with the temporal structure, potentially be generalized to other companies? We would also have a discussion on the adequacy, potential problems with the models and provide suggestions for developing this project in the future.

Exploratory Data Analysis

summary(df)
##       Date                 Open         fb_close       google_close   
##  Min.   :2012-05-18   Min.   :1278   Min.   : 17.73   Min.   : 277.7  
##  1st Qu.:2013-11-11   1st Qu.:1765   1st Qu.: 49.50   1st Qu.: 504.1  
##  Median :2015-05-06   Median :2033   Median : 81.73   Median : 581.6  
##  Mean   :2015-05-06   Mean   :2003   Mean   : 90.37   Mean   : 641.2  
##  3rd Qu.:2016-10-25   3rd Qu.:2178   3rd Qu.:124.94   3rd Qu.: 779.6  
##  Max.   :2018-04-20   Max.   :2867   Max.   :193.09   Max.   :1175.8  
##   apple_close    
##  Min.   : 55.79  
##  1st Qu.: 81.64  
##  Median :105.80  
##  Mean   :108.11  
##  3rd Qu.:126.81  
##  Max.   :181.72
#time trend
p1 = ggplot(data = df %>% arrange(Date), aes(x = Date, y = fb_close)) +
  geom_line() +
  ggtitle("Stock Close Price of Facebook")
p2 = ggplot(data = df %>% arrange(Date), aes(x = Date, y = apple_close)) +
  geom_line() +
  ggtitle("Stock Close Price of Apple")
p3 = ggplot(data = df %>% arrange(Date), aes(x = Date, y = google_close)) +
  geom_line() +
  ggtitle("Stock Close Price of Google")
p4 = ggplot(data = df %>% arrange(Date), aes(x = Date, y = Open)) +
  geom_line() +
  ggtitle("S&P 500 Index Open")
grid.arrange(p1, p2, p3, p4, nrow = 2, ncol = 2)

#scatter plot
p1 = ggplot(data = df, aes(x = Open, y = fb_close)) +
  geom_point(size = 0.1) +
  ggtitle("Close~Open for Facebook")
p2 = ggplot(data = df, aes(x = Open, y = apple_close)) +
  geom_point(size = 0.1) +
  ggtitle("Close~Open for Apple")
p3 = ggplot(data = df, aes(x = Open, y = google_close)) +
  geom_point(size = 0.1) +
  ggtitle("Close~Open for Google")
grid.arrange(p1, p2, p3, nrow = 2, ncol = 2)

We mainly focused on three variables Date, Open (Opening S&P indices) and Close (Closing stock price) for all three companies.

The line plots of all of the closing prices of the three companies against date, as well as the S&P 500 index against date, show linear trend as time goes by. This is an expected trend as the stock market is continuously inflating, especially since the tech companies have been rapidly developing during the time period that we’re looking at. Similarly, the S&P 500 index has an increasing the linear trend. Thus, by using the indices as a predictor/mean structure, we’re able to de-trend the linear trend in the closing prices and then move on to looking at any remaining temporal structure.

Then we look at scatter plots of all three companies with the closing price against the open S&P 500 indices. Even though there is a bit of discrepancy in the trend shown by Apple, it is more or less closer to a linear trend. Thus, the scatter plots further confirmed using the S&P 500 index as the mean structure.

Method 1: Simple Linear Models

The first method is fitting a simple linear model for each individual company. We used the S&P 500 index and the date of which the index and closing prices are collected to predict daily closing price for all three companies separately.

#Apple
apple_lm = lm(apple_close ~ Open + Date, data = df)
summary(apple_lm)
## 
## Call:
## lm(formula = apple_close ~ Open + Date, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.735 -12.223  -0.845  10.784  35.642 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -67.379897  27.767662  -2.427   0.0154 *  
## Open          0.076482   0.003608  21.197   <2e-16 ***
## Date          0.001346   0.002091   0.644   0.5197    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.17 on 1487 degrees of freedom
## Multiple R-squared:  0.8017, Adjusted R-squared:  0.8014 
## F-statistic:  3005 on 2 and 1487 DF,  p-value: < 2.2e-16
df$apple_naive_pred = predict(apple_lm, data=df$Open)
df$apple_naive_residual = df$apple_close - df$apple_naive_pred
ggplot(data = df, aes(x = Date)) +
  geom_line(aes(y = apple_close, color = "red")) +
  geom_line(aes(y = apple_naive_pred, color = "blue")) +
  xlab("time") +
  ylab("Apple Daily Close Stock Price") +
  ggtitle("Simple Linear Model")

rmse(df$apple_close, df$apple_naive_pred)
## [1] 14.15821
rmse(df$apple_close[lower.b:upper.b], df$apple_naive_pred[lower.b:upper.b])
## [1] 11.45351
#Facebook
fb_lm = lm(fb_close ~ Open + Date, data = df)
summary(fb_lm)
## 
## Call:
## lm(formula = fb_close ~ Open + Date, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.215  -5.594  -0.558   4.397  33.428 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.024e+03  1.689e+01  -60.62   <2e-16 ***
## Open         2.239e-02  2.195e-03   10.20   <2e-16 ***
## Date         6.458e-02  1.272e-03   50.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.621 on 1487 degrees of freedom
## Multiple R-squared:  0.969,  Adjusted R-squared:  0.9689 
## F-statistic: 2.321e+04 on 2 and 1487 DF,  p-value: < 2.2e-16
df$fb_naive_pred = predict(fb_lm, data=df$Open)
df$fb_naive_residual = df$fb_close - df$fb_naive_pred
ggplot(data = df, aes(x = Date)) +
  geom_line(aes(y = fb_close, color = "red")) +
  geom_line(aes(y = fb_naive_pred, color = "blue")) +
  xlab("time") +
  ylab("Facebook Daily Close Stock Price") +
  ggtitle("Simple Linear Model")

rmse(df$fb_close, df$fb_naive_pred)
## [1] 8.612343
rmse(df$fb_close[lower.b:upper.b], df$fb_naive_pred[lower.b:upper.b])
## [1] 11.61579
#Google
google_lm = lm(google_close ~ Open + Date, data = df)
summary(google_lm)
## 
## Call:
## lm(formula = google_close ~ Open + Date, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -153.400  -27.964    8.295   33.648  142.760 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.451e+03  1.061e+02  -32.53   <2e-16 ***
## Open         2.014e-01  1.378e-02   14.61   <2e-16 ***
## Date         2.228e-01  7.987e-03   27.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 54.14 on 1487 degrees of freedom
## Multiple R-squared:  0.9377, Adjusted R-squared:  0.9376 
## F-statistic: 1.12e+04 on 2 and 1487 DF,  p-value: < 2.2e-16
df$google_naive_pred = predict(google_lm, data=df$Open)
df$google_naive_residual = df$google_close - df$google_naive_pred
ggplot(data = df, aes(x = Date)) +
  geom_line(aes(y = google_close, color = "red")) +
  geom_line(aes(y = google_naive_pred, color = "blue")) +
  xlab("time") +
  ylab("Google Daily Close Stock Price") +
  ggtitle("Simple Linear Model")

rmse(df$google_close, df$google_naive_pred)
## [1] 54.08638
rmse(df$google_close[lower.b:upper.b], df$google_naive_pred[lower.b:upper.b])
## [1] 61.21239

Method 2: ARIMA Time Series Models

We then proceeded to fit ARIMA time series models for each of the three companies of interest. The model is specified as follows:

\[ \begin{aligned} Close_t &= ARIMA_{(p, q, d) \times (P, Q, D)_s}(Close_{t-1, \cdots}) + \beta_1 * Open_{(S\&P_500)} \end{aligned} \]

Mean structure was added to the models with the open price of S&P apart from the ARIMA time series in predicting Close. We first fitted auto.arima models on all of the three companies and examined their residual plots for further model improvements.

The residual plot for Apple after fitting the auto ARIMA(2,1,0) model looks approximately randomized without salient spikes, so we decided to go with the result. As for Facebook, through close examination of the residual plot after fitting auto ARIMA(2,1,2), we found spikes at period 30. We thus tried to add a seasonal AR or MA trend to the model for potential improvements. However, the residual plots after adding seasonal terms did not show any sign of better performance of the model. What is more, we noticed that the autocorrelation with lag 30 is about 0.1, which is relatively small. So we decided to stick with the result from auto.arima with model ARIMA(2,1,2) for Facebook. We then examined the plots of the auto ARIMA(2,1,2) model for Google: From ACF/PACF plots, we found spikes and potential periodic trend. Thus we treid out different ways of adding seasonal terms to the original ARIMA(2,1,2) in an attempt to get a better model. Yet by evaluating residual plots, adding seasonal terms don’t seem to give a better performance. Plus, we notice that autocorrelation with lag 30 is about 0.1, which is relatively small. Taking all of the above into accound, we eventually decided to stick with the result from auto.arima as well for Google.

The resulting plots of actual closing price and the time series predicted price over Date give fairly close values at each time stamp for all of the three companies, which suggests that ARIMA models are doing a decent job in predicting Close. Also, the root mean squared error(RMSE) for the models of Apple and Facebook, in evaluating the model performances, are both in the range of 1.6~1.7, a fairly small RMSE especially in comparison to the previous simple linear regression. The RMSE for the ARIMA model of Google is slightly higher (9.3), but also suggests an acceptable prediction performance.

#Apple
apple.ts = auto.arima(df %>% select(apple_close), xreg = df$Open, seasonal = TRUE)
apple.ts %>% summary()
## Series: df %>% select(apple_close) 
## Regression with ARIMA(2,1,0) errors 
## 
## Coefficients:
##          ar1      ar2    xreg
##       0.0092  -0.0491  0.0053
## s.e.  0.0303   0.0262  0.0033
## 
## sigma^2 estimated as 2.676:  log likelihood=-2844.26
## AIC=5696.52   AICc=5696.54   BIC=5717.74
## 
## Training set error measures:
##                      ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 0.05806906 1.633794 1.149067 0.03724077 1.097862 0.9989548
##                      ACF1
## Training set -0.001345197
ggtsdisplay(apple.ts$residuals, main = "ARIMA(2,1,0)")

#Residual plot looks good and we will go with the result.
df$apple_ts_pred = c(apple.ts$fitted)
ggplot(data = df, aes(x = Date)) +
  geom_line(aes(y = apple_close, color = "red")) +
  geom_line(aes(y = apple_ts_pred, color = "blue")) +
  xlab("time") +
  ylab("Apple Daily Close Stock Price") +
  ggtitle("ARIMA(2,1,0)")

rmse(df$apple_close, df$apple_ts_pred)
## [1] 1.633794
rmse(df$apple_close[lower.b:upper.b], df$apple_ts_pred[lower.b:upper.b])
## [1] 2.192759
#Facebook
facebook.ts = auto.arima(df %>% select(fb_close), xreg = df$Open, seasonal = TRUE)
facebook.ts %>% summary()
## Series: df %>% select(fb_close) 
## Regression with ARIMA(2,1,2) errors 
## 
## Coefficients:
##           ar1     ar2     ma1      ma2   drift    xreg
##       -0.0576  0.8242  0.0539  -0.8907  0.0860  0.0021
## s.e.   0.0700  0.0682  0.0574   0.0571  0.0306  0.0031
## 
## sigma^2 estimated as 2.82:  log likelihood=-2881.82
## AIC=5777.64   AICc=5777.72   BIC=5814.78
## 
## Training set error measures:
##                        ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -0.002329131 1.675428 1.120526 -0.1057361 1.517573 0.9969723
##                       ACF1
## Training set -0.0004667628
ggtsdisplay(facebook.ts$residuals, main = "ARIMA(2,1,2)")

#After looking at the residual plot, we found spikes at period 30 and tried to see if having a seasonal AR or MA trend makes the model better.
facebook.try1 = Arima(df %>% select(fb_close), xreg = df$Open, order = c(2, 1, 2),seasonal = list(order = c(0, 0, 1),period = 30))
ggtsdisplay(facebook.try1$residuals)

facebook.try2 = Arima(df %>% select(fb_close), xreg = df$Open, order = c(2, 1, 2),seasonal = list(order = c(1, 0, 0),period = 30))
ggtsdisplay(facebook.try2$residuals)

#However, by evaluating residual plots, adding seasonal terms don't seem to give a better performance. Plus, we can see that autocorrelation with lag 30 is about 0.1, which is relatively small. So, we decided to stick with the result from auto.arima.
df$fb_ts_pred = c(facebook.ts$fitted)
ggplot(data = df, aes(x = Date)) +
  geom_line(aes(y = fb_close, color = "red")) +
  geom_line(aes(y = fb_ts_pred, color = "blue")) +
  xlab("time") +
  ylab("Facebook Daily Close Stock Price") +
  ggtitle("ARIMA(2,1,2)")

rmse(df$fb_close, df$fb_ts_pred)
## [1] 1.675428
rmse(df$fb_close[lower.b:upper.b], df$fb_ts_pred[lower.b:upper.b])
## [1] 2.601592
#Google
google.ts = auto.arima(df %>% select(google_close), xreg = df$Open, seasonal = TRUE)
google.ts %>% summary()
## Series: df %>% select(google_close) 
## Regression with ARIMA(2,1,2) errors 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2   drift    xreg
##       -0.5939  -0.8918  0.5656  0.8422  0.4368  0.0666
## s.e.   0.1005   0.0927  0.1259  0.1076  0.2345  0.0197
## 
## sigma^2 estimated as 87.05:  log likelihood=-5435.1
## AIC=10884.2   AICc=10884.28   BIC=10921.34
## 
## Training set error measures:
##                      ME     RMSE      MAE         MPE    MAPE      MASE
## Training set 0.02345534 9.307922 6.106312 -0.01130765 0.96099 0.9965447
##                     ACF1
## Training set 0.008764646
ggtsdisplay(google.ts$residuals, main = "ARIMA(2,1,2)")

#From ACF/PACF plots, we found spikes and potential periodic trend so we tried to add seasonal terms to the model.
google.try1 = Arima(df %>% select(google_close), xreg = df$Open, order = c(2, 1, 2),seasonal = list(order = c(0, 1, 0),period = 7))
ggtsdisplay(google.try1$residuals)

google.try2 = Arima(df %>% select(google_close), xreg = df$Open, order = c(2, 1, 2),seasonal = list(order = c(1, 1, 0),period = 7))
ggtsdisplay(google.try2$residuals)

google.try3 = Arima(df %>% select(google_close), xreg = df$Open, order = c(2, 1, 2),seasonal = list(order = c(0, 1, 1),period = 7))
ggtsdisplay(google.try3$residuals)

#However, after trying different combinations of seasonal terms, we found that ACF/PACF haven't improved much. Plus, autocorrelation value relatively low. So we decided to go with the result from auto.arima function.
df$google_ts_pred = c(google.ts$fitted)
ggplot(data = df, aes(x = Date)) +
  geom_line(aes(y = google_close, color = "red")) +
  geom_line(aes(y = google_ts_pred, color = "blue")) +
  xlab("time") +
  ylab("Google Daily Close Stock Price") +
  ggtitle("ARIMA(2,1,2)")

rmse(df$google_close, df$google_ts_pred)
## [1] 9.307922
rmse(df$google_close[lower.b:upper.b], df$google_ts_pred[lower.b:upper.b])
## [1] 13.79383

Method 3: Gaussian Process

\[ Close_t = \beta X + w_{t} \\ w_{t} \sim GP(0,\Sigma)\\ \Sigma \sim square \ exponential \] Because it takes a long time for JAGS to run large datasets, we subset the dataset to a year of data from 2017-4-21 to 2018-4-20.

###Apple

apple_emp_cloud = subset %>% emp_semivariogram(apple_naive_residual,Date)
apple_emp = rbind(
  subset %>% emp_semivariogram(apple_naive_residual, Date, bin=TRUE, binwidth=5)  %>% mutate(binwidth="binwidth=5"),
  subset %>% emp_semivariogram(apple_naive_residual, Date, bin=TRUE, binwidth=20) %>% mutate(binwidth="binwidth=20"),
  subset %>% emp_semivariogram(apple_naive_residual, Date, bin=TRUE, binwidth=10) %>% mutate(binwidth="binwidth=10"),
  subset %>% emp_semivariogram(apple_naive_residual, Date, bin=TRUE, binwidth=40)   %>% mutate(binwidth="binwidth=40"),
  subset %>% emp_semivariogram(apple_naive_residual, Date, bin=TRUE, binwidth=30)  %>% mutate(binwidth="binwidth=30")
)
apple_emp %>%
  ggplot(aes(x=h, y=gamma)) +
  geom_point(size = 1) +
  ggtitle("Empirical Semivariogram of Apple (binned)")+
  facet_wrap(~binwidth, nrow=2)

apple_gp_exp_model = "model{
  y ~ dmnorm(mu, inverse(Sigma))
  for (i in 1:N) {
    mu[i] <- beta[1]+ beta[2] * x[i]
  }
  
  for (i in 1:(N-1)) {
    for (j in (i+1):N) {
      Sigma[i,j] <- sigma2 * exp(- pow(l*d[i,j],2))
      Sigma[j,i] <- Sigma[i,j]
    }
  }
  for (k in 1:N) {
    Sigma[k,k] <- sigma2 + sigma2_w
  }
  for (i in 1:2) {
    beta[i] ~ dt(coef[i], 2.5, 1)
  }
  sigma2_w ~ dnorm(10, 1/25) T(0,)
  sigma2   ~ dnorm(50, 1/25) T(0,)
  l        ~ dt(0,2.5,1) T(0,) 
}"
betas = tidybayes::gather_draws(exp_cov_coda, beta[i]) %>%
  ungroup() %>%
  mutate(.variable = paste0(.variable, "[",i,"]")) %>%
  select(-i)
betas %>%
  group_by(.variable) %>%
  slice(seq(1,n(),length.out=500)) %>%
  ggplot(aes(x=.iteration, y=.value, color=.variable)) +
    geom_line() +
    facet_grid(.variable~., scales = "free_y")

params = tidybayes::gather_draws(exp_cov_coda, sigma2, l, sigma2_w)
params %>%
  slice(seq(1,n(),length.out=500)) %>%
  ggplot(aes(x=.iteration, y=.value, color=.variable)) +
    geom_line() +
    facet_grid(.variable~., scales="free_y")

params %>%
  slice(seq(1,n(),length.out=500)) %>%
  ggplot(aes(x=.value, fill=.variable)) +
    geom_density() +
    facet_wrap(~.variable, scales="free") +
    guides(fill=FALSE)

params %>%
  slice(seq(1,n(),length.out=500)) %>% 
  filter(.variable == "l") %>%
  ggplot(aes(x=.value, fill=.variable)) +
    geom_density() +
    scale_x_log10() +
    facet_wrap(~.variable, scales="free") +
    guides(fill=FALSE)

post = bind_rows(betas, params) %>%
  group_by(.variable) %>%
  summarize(
    post_mean = mean(.value),
    post_med  = median(.value),
    post_lower = quantile(.value, probs = 0.025),
    post_upper = quantile(.value, probs = 0.975)
  )
knitr::kable(post, digits = 5)
.variable post_mean post_med post_lower post_upper
beta[1] 59.63543 55.80959 37.46380 89.79846
beta[2] 0.03976 0.04138 0.02815 0.04903
l 0.09253 0.09374 0.07121 0.11468
sigma2 48.76159 48.84577 39.31196 56.76181
sigma2_w 2.96200 2.92310 2.34422 3.71781
l = post %>% filter(.variable == 'l') %>% pull(post_med)
sigma2 = post %>% filter(.variable == 'sigma2') %>% pull(post_med)
sigma2_w = post %>% filter(.variable == 'sigma2_w') %>% pull(post_med)
beta0 = post %>% filter(.variable == 'beta[1]') %>% pull(post_med)
beta1 = post %>% filter(.variable == 'beta[2]') %>% pull(post_med)
df = df %>% mutate(apple_gp_resid = apple_close - beta0 - beta1 * Open)
reps=1000
x = subset$Open
y = subset$apple_close
d = subset$Date
x_pred = subset$Open
d_pred = subset$Date + rnorm(252, 0.01)
mu = beta0 + beta1*x
mu_pred = beta0 + beta1*x_pred
dist_o = fields::rdist(d)
dist_p = fields::rdist(d_pred)
dist_op = fields::rdist(d, d_pred)
dist_po = t(dist_op)
cov_o  = sq_exp_cov(dist_o,  sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_p  = sq_exp_cov(dist_p,  sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_op = sq_exp_cov(dist_op, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_po = sq_exp_cov(dist_po, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cond_cov = cov_p - cov_po %*% solve(cov_o) %*% cov_op
cond_mu  = mu_pred + cov_po %*% solve(cov_o) %*% (y - mu)
pred_bayes = cond_mu %*% matrix(1, ncol=reps) + t(chol(cond_cov)) %*% matrix(rnorm(length(x_pred)*reps), ncol=reps)
apple_pred_df_bayes = pred_bayes %>% t() %>% post_summary() %>% mutate(x=x_pred)
apple_pred_df_bayes$Date = subset$Date
ggplot(subset, aes(x=Date)) +
  geom_line(aes(y=apple_close)) +
  geom_ribbon(data=apple_pred_df_bayes, aes(ymin=post_lower,ymax=post_upper, x=Date), fill="red", alpha=0.5) +
  geom_line(data=apple_pred_df_bayes, aes(y=post_mean), color='blue', size=0.5) + 
  ylab("Apple Daily Close Stock Price")

#use mean as the predicted value from bayes to calculate RMSE
subset$apple_gp_pred = apple_pred_df_bayes$post_mean
rmse(subset$apple_close, subset$apple_gp_pred)
## [1] 1.634382

###Facebook

fb_emp_cloud = subset %>% emp_semivariogram(fb_naive_residual,Date)
fb_emp = rbind(
  subset %>% emp_semivariogram(fb_naive_residual, Date, bin=TRUE, binwidth=1)  %>% mutate(binwidth="binwidth=1"),
  subset %>% emp_semivariogram(fb_naive_residual, Date, bin=TRUE, binwidth=5) %>% mutate(binwidth="binwidth=5"),
  subset %>% emp_semivariogram(fb_naive_residual, Date, bin=TRUE, binwidth=10) %>% mutate(binwidth="binwidth=10"),
  subset %>% emp_semivariogram(fb_naive_residual, Date, bin=TRUE, binwidth=15)   %>% mutate(binwidth="binwidth=15"),
  subset %>% emp_semivariogram(fb_naive_residual, Date, bin=TRUE, binwidth=30)  %>% mutate(binwidth="binwidth=30")
)
fb_emp %>%
  ggplot(aes(x=h, y=gamma)) +
  geom_point(size = 1) +
  ggtitle("Empirical Semivariogram of Facebook (binned)")+
  facet_wrap(~binwidth, nrow=2)

fb_gp_exp_model = "model{
  y ~ dmnorm(mu, inverse(Sigma))
  for (i in 1:N) {
    mu[i] <- beta[1]+ beta[2] * x[i]
  }
  
  for (i in 1:(N-1)) {
    for (j in (i+1):N) {
      Sigma[i,j] <- sigma2 * exp(- pow(l*d[i,j],2))
      Sigma[j,i] <- Sigma[i,j]
    }
  }
  for (k in 1:N) {
    Sigma[k,k] <- sigma2 + sigma2_w
  }
  for (i in 1:2) {
    beta[i] ~ dt(coef[i], 2.5, 1)
  }
  sigma2_w ~ dnorm(10, 1/25) T(0,)
  sigma2   ~ dnorm(390, 1/200) T(0,)
  l        ~ dt(0,2.5,1) T(0,) 
}"
betas = tidybayes::gather_draws(exp_cov_coda, beta[i]) %>%
  ungroup() %>%
  mutate(.variable = paste0(.variable, "[",i,"]")) %>%
  select(-i)
betas %>%
  group_by(.variable) %>%
  slice(seq(1,n(),length.out=500)) %>%
  ggplot(aes(x=.iteration, y=.value, color=.variable)) +
    geom_line() +
    facet_grid(.variable~., scales = "free_y")

params = tidybayes::gather_draws(exp_cov_coda, sigma2, l, sigma2_w)
params %>%
  slice(seq(1,n(),length.out=500)) %>%
  ggplot(aes(x=.iteration, y=.value, color=.variable)) +
    geom_line() +
    facet_grid(.variable~., scales="free_y")

params %>%
  slice(seq(1,n(),length.out=500)) %>%
  ggplot(aes(x=.value, fill=.variable)) +
    geom_density() +
    facet_wrap(~.variable, scales="free") +
    guides(fill=FALSE)

params %>%
  slice(seq(1,n(),length.out=500)) %>% 
  filter(.variable == "l") %>%
  ggplot(aes(x=.value, fill=.variable)) +
    geom_density() +
    scale_x_log10() +
    facet_wrap(~.variable, scales="free") +
    guides(fill=FALSE)

post = bind_rows(betas, params) %>%
  group_by(.variable) %>%
  summarize(
    post_mean = mean(.value),
    post_med  = median(.value),
    post_lower = quantile(.value, probs = 0.025),
    post_upper = quantile(.value, probs = 0.975)
  )
knitr::kable(post, digits = 5)
.variable post_mean post_med post_lower post_upper
beta[1] 138.00653 137.23548 92.06659 190.34675
beta[2] 0.01138 0.01184 -0.00740 0.03109
l 0.06331 0.06302 0.05771 0.07039
sigma2 385.32748 384.58300 357.71997 417.85007
sigma2_w 4.45172 4.45002 3.73450 5.41662
l = post %>% filter(.variable == 'l') %>% pull(post_med)
sigma2 = post %>% filter(.variable == 'sigma2') %>% pull(post_med)
sigma2_w = post %>% filter(.variable == 'sigma2_w') %>% pull(post_med)
beta0 = post %>% filter(.variable == 'beta[1]') %>% pull(post_med)
beta1 = post %>% filter(.variable == 'beta[2]') %>% pull(post_med)
df = df %>% mutate(fb_gp_resid = fb_close - beta0 - beta1 * Open)
reps=1000
x = subset$Open
y = subset$fb_close
d = subset$Date
x_pred = subset$Open
d_pred = subset$Date + rnorm(252, 0.01)
mu = beta0 + beta1*x
mu_pred = beta0 + beta1*x_pred
dist_o = fields::rdist(d)
dist_p = fields::rdist(d_pred)
dist_op = fields::rdist(d, d_pred)
dist_po = t(dist_op)
cov_o  = sq_exp_cov(dist_o,  sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_p  = sq_exp_cov(dist_p,  sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_op = sq_exp_cov(dist_op, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_po = sq_exp_cov(dist_po, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cond_cov = cov_p - cov_po %*% solve(cov_o) %*% cov_op
cond_mu  = mu_pred + cov_po %*% solve(cov_o) %*% (y - mu)
pred_bayes = cond_mu %*% matrix(1, ncol=reps) + t(chol(cond_cov)) %*% matrix(rnorm(length(x_pred)*reps), ncol=reps)
fb_pred_df_bayes = pred_bayes %>% t() %>% post_summary() %>% mutate(x=x_pred)
fb_pred_df_bayes$Date = subset$Date
ggplot(subset, aes(x=Date)) +
  geom_line(aes(y=fb_close)) +
  geom_ribbon(data=fb_pred_df_bayes, aes(ymin=post_lower,ymax=post_upper, x=Date), fill="red", alpha=0.5) +
  geom_line(data=fb_pred_df_bayes, aes(y=post_mean), color='blue', size=0.5) + 
  ylab("Facebook Daily Close Stock Price")

#use mean as the predicted value from bayes to calculate RMSE
subset$fb_gp_pred = fb_pred_df_bayes$post_mean
rmse(subset$fb_close, subset$fb_gp_pred)
## [1] 1.970619

###Google

google_emp_cloud = subset %>% emp_semivariogram(google_naive_residual,Date)
google_emp = rbind(
  subset %>% emp_semivariogram(google_naive_residual, Date, bin=TRUE, binwidth=1)  %>% mutate(binwidth="binwidth=1"),
  subset %>% emp_semivariogram(google_naive_residual, Date, bin=TRUE, binwidth=5) %>% mutate(binwidth="binwidth=5"),
  subset %>% emp_semivariogram(google_naive_residual, Date, bin=TRUE, binwidth=10) %>% mutate(binwidth="binwidth=10"),
  subset %>% emp_semivariogram(google_naive_residual, Date, bin=TRUE, binwidth=15)   %>% mutate(binwidth="binwidth=15"),
  subset %>% emp_semivariogram(google_naive_residual, Date, bin=TRUE, binwidth=30)  %>% mutate(binwidth="binwidth=30")
)
google_emp %>%
  ggplot(aes(x=h, y=gamma)) +
  geom_point(size = 1) +
  ggtitle("Empirical Semivariogram of Google (binned)")+
  facet_wrap(~binwidth, nrow=2)

google_gp_exp_model = "model{
  y ~ dmnorm(mu, inverse(Sigma))
  for (i in 1:N) {
    mu[i] <- beta[1]+ beta[2] * x[i]
  }
  
  for (i in 1:(N-1)) {
    for (j in (i+1):N) {
      Sigma[i,j] <- sigma2 * exp(- pow(l*d[i,j],2))
      Sigma[j,i] <- Sigma[i,j]
    }
  }
  for (k in 1:N) {
    Sigma[k,k] <- sigma2 + sigma2_w
  }
  for (i in 1:2) {
    beta[i] ~ dt(coef[i], 2.5, 1)
  }
  sigma2_w ~ dnorm(100, 1/100) T(0,)
  sigma2   ~ dnorm(700, 1/400) T(0,)
  l        ~ dt(0,2.5,1) T(0,) 
}"
betas = tidybayes::gather_draws(exp_cov_coda, beta[i]) %>%
  ungroup() %>%
  mutate(.variable = paste0(.variable, "[",i,"]")) %>%
  select(-i)
betas %>%
  group_by(.variable) %>%
  slice(seq(1,n(),length.out=500)) %>%
  ggplot(aes(x=.iteration, y=.value, color=.variable)) +
    geom_line() +
    facet_grid(.variable~., scales = "free_y")

params = tidybayes::gather_draws(exp_cov_coda, sigma2, l, sigma2_w)
params %>%
  slice(seq(1,n(),length.out=500)) %>%
  ggplot(aes(x=.iteration, y=.value, color=.variable)) +
    geom_line() +
    facet_grid(.variable~., scales="free_y")

params %>%
  slice(seq(1,n(),length.out=500)) %>%
  ggplot(aes(x=.value, fill=.variable)) +
    geom_density() +
    facet_wrap(~.variable, scales="free") +
    guides(fill=FALSE)

params %>%
  slice(seq(1,n(),length.out=500)) %>% 
  filter(.variable == "l") %>%
  ggplot(aes(x=.value, fill=.variable)) +
    geom_density() +
    scale_x_log10() +
    facet_wrap(~.variable, scales="free") +
    guides(fill=FALSE)

post = bind_rows(betas, params) %>%
  group_by(.variable) %>%
  summarize(
    post_mean = mean(.value),
    post_med  = median(.value),
    post_lower = quantile(.value, probs = 0.025),
    post_upper = quantile(.value, probs = 0.975)
  )
knitr::kable(post, digits = 5)
.variable post_mean post_med post_lower post_upper
beta[1] -498.77787 -500.77321 -505.52283 -468.68757
beta[2] 0.58398 0.58450 0.57281 0.58877
l 0.09541 0.09349 0.07905 0.12334
sigma2 697.70238 697.63588 663.18587 732.68595
sigma2_w 126.16484 126.77453 113.52041 139.03941
l = post %>% filter(.variable == 'l') %>% pull(post_med)
sigma2 = post %>% filter(.variable == 'sigma2') %>% pull(post_med)
sigma2_w = post %>% filter(.variable == 'sigma2_w') %>% pull(post_med)
beta0 = post %>% filter(.variable == 'beta[1]') %>% pull(post_med)
beta1 = post %>% filter(.variable == 'beta[2]') %>% pull(post_med)
df = df %>% mutate(google_gp_resid = google_close - beta0 - beta1 * Open)
reps=1000
x = subset$Open
y = subset$google_close
d = subset$Date
x_pred = subset$Open
d_pred = subset$Date + rnorm(252, 0.01)
mu = beta0 + beta1*x
mu_pred = beta0 + beta1*x_pred
dist_o = fields::rdist(d)
dist_p = fields::rdist(d_pred)
dist_op = fields::rdist(d, d_pred)
dist_po = t(dist_op)
cov_o  = sq_exp_cov(dist_o,  sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_p  = sq_exp_cov(dist_p,  sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_op = sq_exp_cov(dist_op, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_po = sq_exp_cov(dist_po, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cond_cov = cov_p - cov_po %*% solve(cov_o) %*% cov_op
cond_mu  = mu_pred + cov_po %*% solve(cov_o) %*% (y - mu)
pred_bayes = cond_mu %*% matrix(1, ncol=reps) + t(chol(cond_cov)) %*% matrix(rnorm(length(x_pred)*reps), ncol=reps)
google_pred_df_bayes = pred_bayes %>% t() %>% post_summary() %>% mutate(x=x_pred)
google_pred_df_bayes$Date = subset$Date
ggplot(subset, aes(x=Date)) +
  geom_line(aes(y=google_close)) +
  geom_ribbon(data=google_pred_df_bayes, aes(ymin=post_lower,ymax=post_upper, x=Date), fill="red", alpha=0.5) +
  geom_line(data=google_pred_df_bayes, aes(y=post_mean), color='blue', size=0.5) + 
  ylab("Google Daily Close Stock Price")

#use mean as the predicted value from bayes to calculate RMSE
subset$google_gp_pred = google_pred_df_bayes$post_mean
rmse(subset$google_close, subset$google_gp_pred)
## [1] 12.26636

Conclusion

subset %>% summarise(method = "rmse",
                     google.lm = rmse(google_close, google_naive_pred),
                     google.arima = rmse(google_close, google_ts_pred),
                     google.gp = rmse(google_close, google_gp_pred),
                     fb.lm = rmse(fb_close, fb_naive_pred),
                     fb.arima = rmse(fb_close, fb_ts_pred),
                     fb.gp = rmse(fb_close, fb_gp_pred),
                     apple.lm = rmse(apple_close, apple_naive_pred),
                     apple.arima = rmse(apple_close, apple_ts_pred),
                     apple.gp = rmse(apple_close, apple_gp_pred))
##   method google.lm google.arima google.gp    fb.lm fb.arima    fb.gp
## 1   rmse  61.21239     13.79383  12.26636 11.61579 2.601592 1.970619
##   apple.lm apple.arima apple.gp
## 1 11.45351    2.192759 1.634382

After the series of model fitting and analyses, we are now able to answer the questions of interest that we brought up earlier in the introduction section about this particular case, and potentially lead to more conclusions and discoveries.

First of all, even though the overall trend in the data as time goes by show an increasing linear trend, it is obvious that there is more structure within the data than a simple linear trend, and these structures need to be explained to make more precise predictions and inferences. By looking at the residual structures, ACF and PACF plots of all three companies’ closing prices, we were able to conclude that the time series trend in Close does exist. With a temporal structure implemented, we should be able to explain more variation in the data than simply a linear model with 2 predictors.

Therefore, when we later fit both ARIMA and Gaussian Process models, their performances have both significantly improved and were both able to, to different degrees, explain the time variability. Both the plot with fitted values from these models and the significantly decreasing RMSE values show that, models taking into account the time series effect are better candidates for modeling fitting in this case. This result matched with our initial intuition that, because of the nature of the stock market, the prices of a certain day would have some dependencies on the previous days due to the overall business performance trend of a certain company in the stock market.

Now if we take a closer look at both of the time series models, even though the differences between the two are very minor comparing to how much they’ve both improved from the linear model, the Gaussian Process model has slightly outperformed the ARIMA model. The reason for this slight improvement, we suspect, is that due to the fact that stock markets are closed on weekends thus there is not data for weekends, the discrete time assumption for the ARIMA model is weakened. However, Gaussian process considers the distance between days in a continuous way instead of a discrete way like ARIMA, and therefore constructs a more expressive correlation between days.

Even though the Gaussian Process model is the best-performing one of the three in the previous analyses, it is still extremely computationally costly, thus we were not able to eventually fit more observations, but only subset one year worth of data. This sacrifice diminishes the model’s predictive capacity, even though the precision right now if the highest amon all three. This, on the other hand, is an advantage of ARIMA model because it requires shorter training time while maintaining an acceptable amount of tradeoff of increase in error.

Overall, our models have successfully discovered and explained some of the more complex time series variation in the data besides explaining the obvious increasing linear trend. There are pros and cons to both time series models, and the choice of which method to use depends on the decision of whether computation simplicity or model fitting precision is valued more heavily. In terms of the conclusion on the three companies we look at specifically, since they are all tech companies continuously growing and developing over the past decade, they share many traits in common (tech-based backgrounds, dependencies on the growth of internet and etc.), it is reasonable that they also share similar patterns of growth in terms of their performances in the stock market. Since the stock market, as well as the S&P 500 index all show trend of continuously increasing and inflating, Facebook, Apple and Google happen to match with the overall trend of the predictor. However, we cannot simply generalize this model to all companies on the stock market, because not all industries are having their booming moments like the ones we studied on for this particular case. More factors may contribute to the pricing of companies from other industries, and thus should be looked at separately.

In the next section, we will discuss some of the limitations and suggestions for our final model.

Discussion (Limitations and Suggestions)

  1. Validation and testing. Due to the fact we were testing on the training data, overfitting may be a main concern. If overfitting exists, it will potentially cause issues in extrapolation. Thus we need more data for validation and testing in order to conclude on the performance of the models more precisely and responsibly.

  2. As mentioned above, the computational complexity and costs for fitting a Gaussian Process model are very high, thus we were not able to fit as much data into the model as the other two methods. In order to fit more data into the Gaussian process, we may need to use parallel computing or a more powerful CPU. Since Gaussian Process is an infinite neural network, Recurrent Neural Network may be a solution to this dilemma.

  3. Even though Gaussian Process is chosen over ARIMA in this specific case, ARIMA still has its advatanges over other methods. If ecountered a case with the time series being more strictly discrete, ARIMA model may be a better option considering its strength in complexity and prediction.